Skip to content

admin api: Serial Console Retrieval#4257

Open
rajdeepc2792 wants to merge 13 commits intomainfrom
rajdeepc2792/ARO-17482
Open

admin api: Serial Console Retrieval#4257
rajdeepc2792 wants to merge 13 commits intomainfrom
rajdeepc2792/ARO-17482

Conversation

@rajdeepc2792
Copy link
Copy Markdown
Collaborator

@rajdeepc2792 rajdeepc2792 commented Feb 27, 2026

https://issues.redhat.com/browse/ARO-17482

What

Introduces Admin-API Serial Console endpoint that provides SREs with access to VM serial console logs for debugging boot issues, kernel panics, and other low-level problems in HCP cluster's Nodepool VMs.

GET /admin/v1/hcp{resourceId}/serialconsole?vmName={vmName}

Why

  • Debugging boot failures in worker nodes
  • Investigating kernel panics
  • Diagnosing networking issues preventing node readiness
  • Troubleshooting systemd service failures during startup
  • Post-mortem analysis of node crashes

Special notes for your reviewer

Testing

@openshift-ci openshift-ci bot requested review from geoberle and miquelsi February 27, 2026 20:35
Copilot AI review requested due to automatic review settings March 6, 2026 13:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an Admin API endpoint to retrieve Azure VM serial console logs for HCP clusters, enabling SRE debugging via boot diagnostics/serial console output.

Changes:

  • Introduces GET /admin/v1/hcp{resourceId}/serialconsole?vmName=... handler that calls Azure Compute boot diagnostics APIs and streams the serial console log blob back to the caller.
  • Adds vmName validation helper + unit tests, and refactors shared Cluster Service error handling for correct 404 behavior.
  • Adds E2E coverage and supporting test framework helpers for discovering a VM and toggling boot diagnostics.

Reviewed changes

Copilot reviewed 15 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
admin/server/server/admin.go Registers the new /serialconsole HCP admin route.
admin/server/handlers/hcp/serialconsole.go Implements the serial console retrieval handler (boot diagnostics lookup + blob download/stream).
admin/server/handlers/hcp/serialconsole_test.go Adds handler-level unit tests for validation and dependency error paths.
admin/server/handlers/hcp/helpers.go Adds reusable ClusterServiceError helper to correctly translate CS 404s.
admin/server/handlers/hcp/helpers_test.go Unit tests for ClusterServiceError.
admin/server/handlers/hcp/breakglass/create.go Switches to shared ClusterServiceError helper (removes local duplicate).
internal/validation/validators.go Adds Azure VM name regex + exported validation helper.
internal/validation/validate_vmname.go Adds tests for VM name validation (currently added as a non-*_test.go file).
test/util/framework/admin_api_helpers.go Adds test helpers for calling the new endpoint and manipulating/listing VMs via ARM compute.
test/e2e/admin_api.go Adds E2E tests for serial console happy path and “boot diagnostics disabled” conflict case.
test/testdata/zz_fixture_TestMainListSuitesForEachSuite_rp_api_compat_all_parallel_01rp_api_compat_all_parallel_development.txt Updates suite listing fixture to include the new SRE serial console test.
test/testdata/zz_fixture_TestMainListSuitesForEachSuite_dev_cd_check_paralleldev_cd_check_parallel.txt Updates suite listing fixture to include the new SRE serial console test.
admin/server/go.mod Adds Azure compute SDK dependency (and adjusts mock dependency classification).
admin/server/go.sum Updates sums for new/updated dependencies.
test-integration/go.mod Adds armcompute/v5 as an indirect dependency (appears unused within test-integration/).
test-integration/go.sum Updates sums for new/updated dependencies.
admin/README.md Documents the new admin endpoint in the API table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch 2 times, most recently from a0834b8 to bbe2bd9 Compare March 6, 2026 13:52
Copilot AI review requested due to automatic review settings March 6, 2026 13:52
@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch from bbe2bd9 to 64f0ced Compare March 6, 2026 13:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 17 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch from 64f0ced to 207848e Compare March 6, 2026 14:07
Copilot AI review requested due to automatic review settings March 6, 2026 14:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 17 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rajdeepc2792
Copy link
Copy Markdown
Collaborator Author

/test e2e-parallel


// download blob content with timeout to avoid stuck handlers on slow blob endpoints
httpClient := &http.Client{
Timeout: 30 * time.Second,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a client-side timeout instead of plumbing contexts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the client-side timeout

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bennerv should we still do a context.WithTimeout to be explicit about how long we want to wait here? otherwise we just get we the request context defines.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't added the context.WithTimeout yet, can wait for the decision. Or, handle it in the middleware at some point if we want to have a default request timeout?

Copy link
Copy Markdown
Collaborator

@SudoBrendan SudoBrendan Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote:

  1. We 100% should add middleware with some kind of timeout (5 minutes?) as a stop-gap
  2. As needed for optimization, individual handlers can/should restrict that timeout even further to exit earlier so SREs aren't kept waiting on reasonably-known failure states for a particular task.

@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch from 77f6ef2 to c06f4c5 Compare March 18, 2026 13:58
Copilot AI review requested due to automatic review settings March 18, 2026 14:05
@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch from c06f4c5 to 2302601 Compare March 18, 2026 14:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 17 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch from 2302601 to 47d7e02 Compare March 18, 2026 14:38
Copilot AI review requested due to automatic review settings March 20, 2026 15:56
@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch from 47d7e02 to 87f6c6a Compare March 20, 2026 15:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 17 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 24, 2026 15:36
@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch from ccadc3f to 7d8cf3a Compare March 24, 2026 15:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 17 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch from 7d8cf3a to ff7df37 Compare March 24, 2026 16:09
Copy link
Copy Markdown
Collaborator

@tony-schndr tony-schndr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall

Copilot AI review requested due to automatic review settings March 25, 2026 12:17
@rajdeepc2792 rajdeepc2792 force-pushed the rajdeepc2792/ARO-17482 branch from ff7df37 to c715df0 Compare March 25, 2026 12:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 17 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// specific CloudError. This prevents ReportError from misinterpreting it as
// "HCP resource not found" (the HCP was already found in the database).
// Non-OCM errors are wrapped for ReportError to handle as internal errors.
func ClusterServiceError(err error, what string) error {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially in the PR this was being used by the serialconsolelogs implementation, but since it no longer uses it, this can be moved back to breakglass. But decided to keep it here for any similar future use.
Please comment if anyone feel against moving the function here in this PR.

Copy link
Copy Markdown
Collaborator

@SudoBrendan SudoBrendan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that overall, this /lgtm - if you want to address any of these (or do a follow-up) that's just as well. Thanks Raj!

http.StatusConflict,
arm.CloudErrorCodeConflict,
"",
"Diagnostics might be disabled for VM %s", vmName,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as an SRE working a 2am incident - if I saw this error message raw I might be confused and need to review the code to understand the error. Adding a bit of clarity to what this error actually means (and how to resolve it?) for on-call could be helpful if that's even possible. I think this is especially true considering we never expect diagnostics to be off - the oddest cases likely deserve the clearest messaging.

Comment on lines +181 to +182
limitedReader := io.LimitReader(blobResp.Body, maxBytes)
_, err = io.Copy(writer, limitedReader)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we write a warning or something if data is truncated at 100MB? Note - this truncates to the beginning rather than the end of the content. Should we instead return the end of the log?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I will address in the next PR where I can just handle the streaming of the logs.

writer.Header().Set("Cache-Control", "no-cache, no-store, must-revalidate")
writer.Header().Set("Pragma", "no-cache")
writer.Header().Set("Expires", "0")
writer.WriteHeader(http.StatusOK)
Copy link
Copy Markdown
Collaborator

@SudoBrendan SudoBrendan Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug: once you write a header - it can NEVER be overwritten. Your errors at L183 are silently ignored and go to logs - but SREs are getting 200/OK (and potentially getting partial data) by putting this here.

I think the best way for us to handle this is to buffer the full content and only return once we have the data in-hand. if we're capping at 100MB I think this is fine?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true and that's the idea when streaming the logs.

Copy link
Copy Markdown
Collaborator

@SudoBrendan SudoBrendan Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I meant - streaming can fail, and there's not a great way to handle those failures and notify on-call of the failures. Considering we're only dealing with 100MB, it might be best to eliminate streaming all-together, and only return once we've got the full data to return safely with 200. That has implications on memory for admin-api for sure - but it seems reasonable - and I don't think we expect to call this endpoint often (e.g. I don't see us invoking this on every ICM; it's just a critical gap we need to fill in rare situations).

ultimately it's not a huge deal either way - the scenarios are slim - but considering that (silent) failures here could lead to incorrect assumptions in triage of incidents, it seems important to be more correct than resource-aware?

Expect(err.Error()).To(ContainSubstring("400"))
})

It("should return 409 when boot diagnostics is disabled on a VM",
Copy link
Copy Markdown
Collaborator

@SudoBrendan SudoBrendan Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we save a cluster create and NP create cycle by adding the "fetch VM -> disable -> 409" to the previous suite? I guess that conflates positive/negative ... but this feels heavy for a single HTTP test. Could it be an integration test?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I thought of doing it in a single suite, but separated it as the disabling one requires an external action on the VM to disable the boot diagnostics which might not work in the envs where there's deny assignment in place. But the previous suite would work in any environment.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rajdeepc2792, SudoBrendan
Once this PR has been reviewed and has the lgtm label, please assign janboll for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants